Who is this for?

Most helpful if you:

  • Use a variety of modeling methods: linear models, generalized linear models, LASSO, trees, random forest, etc.
  • Are familiar with cross-validation.
  • Use the carat package.
  • Are comfortable with the tidyverse.

Also useful if you:

  • Have used some modeling techniques, like lm().
  • Are excited about learning machine learning.
  • Love gardening!

What will we cover?

Bird’s eye view of my garden, Image Credit: Google Maps

What will we cover?

Follow along in the rmd file.

Libraries

The libraries we will use:

library(tidyverse)         # for reading in data, graphing, and cleaning
library(lubridate)         # for date manipulation
library(tidymodels)        # for modeling
library(moderndive)        # for King County housing data
library(vip)               # for variable importance plots
theme_set(theme_minimal()) # my favorite ggplot2 theme :)

Similar to tidyverse, tidymodels is a collection of packages:

tidymodels_packages()
##  [1] "broom"         "cli"           "crayon"        "dials"        
##  [5] "dplyr"         "ggplot2"       "infer"         "magrittr"     
##  [9] "parsnip"       "pillar"        "purrr"         "recipes"      
## [13] "rlang"         "rsample"       "rstudioapi"    "tibble"       
## [17] "tidytext"      "tidypredict"   "tidyposterior" "tune"         
## [21] "workflows"     "yardstick"     "tidymodels"

The data

According to the house_prices documentation, "This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

data("house_prices")

house_prices %>% 
  slice(1:10)

We will model the home price using the other variables in the model.

Exploration

Exploration

Overview of modeling process

Data Splitting

set.seed(327) #for reproducibility

# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices, 
                             prop = .75)
house_split
## <16210/5403/21613>
#<training/testing/total>

house_training <- training(house_split)
house_testing <- testing(house_split)

Data Splitting

Later, we will use 5-fold cross-validation to evaluate the model and tune model parameters.

set.seed(1211) # for reproducibility
house_cv <- vfold_cv(house_training, v = 5)

Data preprocessing and recipe

A variety of step_xxx() functions can be used to do data pre-processing/transforming. Find them all here.

Apply

Apply to training dataset, just to see what happens. Notice the names of the variables.

house_recipe %>% 
  prep(house_training) %>%
  juice() 

Defining the model

In order to define our model, we need to do these steps:

  • Define the model type, which is the general type of model you want to fit.
  • Set the engine, which defines the package/function that will be used to fit the model.
  • Set the mode, which is either “regression” for continuous response variables or “classification” for binary/categorical response variables. (Note that for linear regression, it can only be “regression”, so we don’t NEED this step in this case.)
  • (OPTIONAL) Set arguments to tune. We’ll see an example of this later.

Find all available functions from parsnip here. Here is the detail for linear regression.

house_linear_mod <- 
  # Define a linear regression model
  linear_reg() %>% 
  # Set the engine to "lm" (lm() function is used to fit model)
  set_engine("lm") %>% 
  # Not necessary here, but good to remember for other models
  set_mode("regression")

Creating a workflow

This combines the preprocessing and model definition steps.

house_lm_wf <- 
  # Set up the workflow
  workflow() %>% 
  # Add the recipe
  add_recipe(house_recipe) %>% 
  # Add the modeling
  add_model(house_linear_mod)

house_lm_wf
## ══ Workflow ═════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_rm()
## ● step_log()
## ● step_mutate()
## ● step_rm()
## ● step_date()
## ● step_dummy()
## 
## ── Model ────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Modeling

We first use the fit() function to fit the model, telling it which data set we want to fit the model to. Then we use some other functions to display the results nicely.

house_lm_fit <- 
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model to the training data
  fit(house_training)

# Display the results nicely
house_lm_fit %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  mutate_if(is.numeric, ~round(.x,3))

Model evaluation

To evaluate the model, we will use cross-validation (CV), specifically 5-fold CV. First, let’s take a moment to review (or learn!) what 5-fold CV means.

Image credit: https://bradleyboehmke.github.io/HOML/process.html#resampling

Evaluate model (code)

house_lm_fit_cv <-
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model (using the workflow) to the cv data
  fit_resamples(house_cv)

# The evaluation metrics for each fold:
house_lm_fit_cv %>% 
  select(id, .metrics) %>% 
  unnest(.metrics)
# Evaluation metrics averaged over all folds:
collect_metrics(house_lm_fit_cv)